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ABSTRACT 


The  report  describes,  with  examples,  some  of  the  effects  upon  model 
formulation  and  model  applicability,  of  differences  in  the  aggregation 
levels  among  various  data  sources,  (primarily  populations). 
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EFFECTS  OF  DATA  AGGREGATION  IN  MODELLING 


Three  of  the  most  common  problems  related  to  aggregation  of  data 
sources  are  1)  possible  nonlinearity  of  the  relationship  between  de- 
pendent and  independent  variables,  2)  nonlinearities  in  the  underlying 
relationships  between  populations  and  the  characteristics  which  are  to 
be  used  as  the  independent  variables,  and  3)  homogeneities  induced  by 
grouping  differing  populations  together. 

The  effect  of  nonlinearities  in  the  relationship  between  dependent 
and  independent  variables, is  that  a model  constructed  at  a given  aggre- 
gation level  cannot  be  applied  at  other  levels.  The  effect  of  non- 
linearities  in  the  underlying  relationships  is  that  the  relationships 
between  the  "dependent"  and  "independent"  variables  will  be  elusive,  and 
the  effect  of  homogeneities  is  that  identifying  differences  in  the  "in- 
dependent" variables  may  be  suppressed  entirely,  making  model  construction 
impossible. 

A simple  example  of  nonlinearity  will  serve  to  illustrate  the  first 

problem.  Suppose  that  the  number  of  telephone  calls  (T)  per  day  in  a ci  ty 

is  quadratically  related  to  its  population  (P).  (This  is  more  reasonable 

than  to  assume  that  the  relationship  is  linear,  i.e.,  proportional  to  the 

population,  because  the  number  of  pairs  of  people  increases  faster  than 

the  number  of  people,  and  a phone  call  is  a pair  phenomenon.)  We  would 
2 

then  have  T = CP  where  C is  a scaling  constant.  Now  if  a pair  of  neigh- 
boring cities  are  considered  as  a single  aggregated  population  according 
to  some  criterion,  e.g.,  Minneapolis  - St.  Paul,  San  Francisco  - Oakland, 
Philadelphia  - Camden,  etc.  then 
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applying  this  model  we  would  have  T = C(P^  + P^)  = C(P^  + P2  ) + 

2 9 

2C(P  P ) a number  which  could  be  considerably  larger  than  T = C(Pi  + P9"^), 
i.  2 ^ 


calculated  as  a sum  of  the  values  for  the  pair  of  locations  separately. 

If  the  relationship  were  linear,  i.e.,  T = CP,  then  areas  could  be  ag- 
gregated without  penalty.  This  means  that  a nonlinear  model  constructed 
at  any  given  level  cannot  be  applied  without  carefully  ascertaining  that 
the  regions  to  which  it  is  applied  really  are  equivalent  demographically 
to  those  from  which  the  model  was  deduced. 

Problems  2 and  3 affect  the  character  of  a model  that  may  be  evolved 
from  a data  base,  and  in  fact  overlap,  so  that  they  can  be  discussed  to- 
gether. 

In  general,  the  perceived  character  of  an  observed  population  is 
strongly  affected  by  the  way  the  boundaries  for  the  observations  are  de- 
lineated. Sample  data  for  use  in  constructing  models  should  be  gathered 
therefore,  with  some  circumspection. 

Median  values  and  rankings  are  parameters  whose  magnitudes  depend  on 
the  level  of  aggregation  of  described  populations.  For  a simple  example 
consider  a population  P of  6 people  a^,  a^,  a^,  b^ , b^,  b^  with  incomes 
given  below  and  grouped  as  shown  into  two  subgroups  (A  & B)  with  three 

: $1000 
a^  : $2000 
a3  : $12000 
b^  : $4000 
b^  : $5000 
b^  : $6000 
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The  average  incomes  of  P,  A and  B,  are  all  $5000.  The  median  income 


of  A is  $2000,  of  B,  $5000  and  of  P,  $4500,  i.e.,  the  median  income  of 
the  aggregated  subgroups  is  not  related  to  their  individual  medians.  As 
for  the  effect  of  rankings,  if  we  stratify  incomes  by  thousands  of  dollars, 
the  fractions  of  the  populations  of  A and  B in  the  lowest  stratum  are  both 
1/3  (1/3  of  the  people  in  B are  in  the  lowest  stratum  for  that  subgroup: 
$4000),  whereas  the  fraction  of  the  population  of  P in  the  lowest  stratum 
is  1/6.  For  this  kind  of  parameter  even  a linear  model  will  fail  under 
aggregation  because  the  parameters  themselves  are  nonlinearly  related  to 
the  populations  which  they  describe. 

Phenomena  depending  on  concentration  level,  are  frequently  encountered, 
typically  in  epidemiological  studies.  A simple  example  here  is  given  by 
two  populations  of  families  with  the  same  total  number  of  young  children, 
in  which  the  children  are  distributed  one  per  family  in  the  first  popula- 
tion, and  three  in  every  third  family  in  the  second.  A survey,  for  instance, 
of  the  consumption  of  children's  clothing  would  probably  find  higher  per 
capita  consumption  in  the  first  population  than  in  the  second  because  of 
the  use  of  hand-me-downs  in  the  second  group,  even  though  the  mean  statis- 
tics of  the  two  populations  are  otherwise  identical.  (To  make  the  illustra- 
tion cogent  we  can  imagine  that  all  families  have  the  same  income.) 

The  word  aggregation  in  the  context  of  a statistical  study  or  con- 
struction of  a mathematical  model  refers  to  the  way  the  "subjects"  of  the 
study  or  model,  are  grouped  in  the  analysis  or  subsequent  application. 

The  degree  to  which  subjects,  "populations"  or  observations  are  combined 
is  called  the  level  of  aggregation. 
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As  an  example,  census  grouping  by  county,  state,  or  region  defines 
three  levels  of  aggregation.  Similarly  stratifying  population  by  family 
incomes  in  $5000/annum  steps  or  $10, 000/annum  steps  defines  two  levels  Of 
aggregation.  On  the  other  hand,  grouping  the  population  into  urban  and 
rural  classes  is  a non  hierarchical  or  non  stratified  kind  of  aggregation. 

In  this  report  we  shall  be  concerned  with  stratified  aggregations,  that  is 
with  levels  of  aggregation. 

We  can  see  how  aggregation  can  disguise  the  character  of  population 
by  looking  again  at  our  first  example.  Let  us  flesh  it  out  a little  with 
some  definitions  and  some  geography.  We  will  say  that  incomes  under 
$5,000  are  "low”,  from  $5,000  to  $9,999,  "middle"  incomes,  and  over  $10,000, 
"high".  We  will  also  locate  our  population  on  a little  map. 
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Now  recall  that  the  mean  income  of  what  we  will  now  call  "area"  A is 
$5,000,  the  median  income  $2,000.  Thus  by  one  measure  A is  a low  income 
area,  by  another,  a middle  income  area.  In  fact,  it  can  be  split  into  a 
low  and  a high  income  population,  and  the  aggregation  even  at  this  low 
level  distorts  it.  Again,  with  regard  to  the  distribution  of  incomes 
rather  than  measures  like  means  and  medians,  we  have  added  a second  "area" 
P*.  At  the  level  of  aggregation  in  which  we  compise  the  sub  sets,  P and 
P'  look  quite  alike  in  terms  of  income:  (1,  2,  4,  5,  6,  12)  and  (1,  2,  3, 
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6,  8,  10).  Whereas  in  fact  P is  comprised  of  a middle  income  area  B and 
a mixed  high- low  area  A,  while  P’  comprises  the  low  income  A'  and  a sort 
of  upper  middle  area  B’.  Notice,  moreover  that  at  this  aggregation  the 
analyst  looking  at  the  "census  figure"  level  would  deduce  a slightly 
different  image  of  the  total  area  if  the  boundary  lines  of  the  samples 
were  the  ones  shown  on  the  second  version  of  the  "map". 
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DATA  SOURCES 

Unfortunately,  social  researchers,  unlike  physical  scientists  usually 
are  not  able  to  acquire  data  for  construction  and  validating  hypotheses 
by  conducting  controllable’  experiments. 

Perhaps  this  constitutes  an  overstatement  of  the  difficulty.  With 
sufficient  resources,  (in  time  and  tangibles),  a data  collection  program 
specifically  designed  for  the  purpose  of  testing  a hypothesis  concerning 
some  human  phenomenon  is  almost  always  possible.  "Sufficient,"  however, 
frequently  means  "very  great".  Consequently  one  usually  turns  to  exist- 
ing tabulated  data  sources,  chiefly  the  various  compilations  of  statistics 
by  the  U.S.  Bureau  of  the  Census.  In  the  construction  of  a first  stage 
mathematical  model  of  incidence  of  elevated  blood  levels  (EBL)  in  young 
children  resulting  from  ingestion  of  lead  paint,  which  is  the  subject  of 
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this  report,  the  effect  of  inconvenient  sample  boundaries  has  been  a 
particularly  vexatious  problem. 

The  details  of  the  model  are  described  in  a report  entitled  "A  Model 
to  Estimate  the  Incidence  of  Lead  Paint  Poisoning".  It  suffices  here  to 
state  that  the  hypothesis  underlying  the  model  is  that  the  rate  of  in- 
cidence of  EBL  in  an  area  can  be  formally  related  to  a)  the  fraction  of 
aged  and  deteriorating  housing,  which  determines  the  level  of  the  presence 
of  lead  paint,  b)  representative  socio-economic  status, which  is  a deter- 
minant of  this  environmental  hazard,  and  c)  the  degree  of  accessibility 
to  contaminated  paint  by  the  "population  at  risk":  children  up  to  six 

years  of  age. 

In  the  absence  of  an  independent  data  collection  effort  for  lead 
poisoning  estimation,  it  is  natural  to  inquire  how  well  the  Decennial 
Census  can  serve  as  a data  pool  for  the  independent,  i.e,,  the  determining 
variables,  of  a model  with  the  underlying  ideas  as  stated.  Let  us  examine 

the  structure  of  census  compilation. 

The  basic  unit  in  compilation  of  census  population  data  is  the  block 

group  in  urban  localities  and  the  enumeration  district  elsewhere.  Block 

groups  are,  as  the  name  implies,  collections  of  "bunched"  city  blocks, 

with  aggregate  population  of  about  1000.  Enumeration  districts  are 

roughly  equivalent  rural  areas.  Block  groups  (or  sometimes  enumeration 
districts)  are  aggregated  into  clusters  whose  populations  should  be  close 

to  4000  according  to  Bureau  of  Census  demographers.  These  clusters  are 

called  census  tracts.  Census  tracts  are  the  smallest  subdivisions  for 

which  computer  tapes  are  available  at  present  (November  1971) containing 

the  first  count  population  and  housing  tabulations  of  the  1970  census. 
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Tract  boundaries  were  drawn  originally  in  8 cities  for  the  1910 
decennial  census,  and  the  number  of  tracted  areas  has  increased  pro- 
gressively since  then.  The  early  tracts  were  ethnically  homogeneous, 
probably  because  of  the  widespread  interest  in  the  image  of  the  United 
States  as  a "melting  pot"  of  races  and  nationalities.  Today  responsibility 
for  delineations  is  vested  in  local  tract  commissions,  assisted  but  not 
bound  by  Bureau  of  the  Census  guidelines  on  population  size  and  ethnic 
and  socio-economic  homogeneity^,  with  the  result  that  the  more  than 
50,000  current  tracts  in  the  U.S.  are  haphazard  in  character  with 
populations  ranging  from  1500  to  12000,  and  degrees  of  homogeneity  largely 
determined  by  the  professional  compositions  of  the  (autonomous) tract 
commissions. 

Data  requirements  and  other  costs,  i.e.  the  difficulties  of  handling 
50,000  tracts  rather  than  250  metropolitan  areas,  discourage  using  a census 
tract  model  for  a first  cut  estimation  of  the  national  incidence  of 
"pediatric  EBL". 

On  the  other  hand,  incidence  data  have  not  been  available  up  to  this 
time  for  the  purpose  of  calibrating  a model  at  a higher  level  of  aggregation 
(i.e.,  county,  city  or  SMSA).  This  remark  requires  some  amplification. 

Data  have  been  gathered  in  some  20  cities,  but  in  addition  to  inherent 
problems  in  compilation  of  information,  we  are  confronted  with  disparate 
surveys  without  common  sampling  methods,  common  definitions  of  EBL,  or 

^Census  Tract  Manual,  Fifth  Edition,  U.S.  Department  of  Commerce, 

Bureau  of  the  Census,  Jan.  1966. 


7 


s', 


■i 


^ ^ nl  y > jtt.o  n»^flTb  ^*>i*;abmiod  jcu?-:ct'^'.^ 

PHBPEirSi^'^ 

'»■  '■  ™ ' ■'1 

.i  y f it-.j^- »iiT  •jrsiCiT  ^saie  y-far  m^sTri 

Bt’  ' ^ " 


bar^l/ifl^  :j^ff3f-^Va  #y6m?'  3/iy  ^<^  i'i^l^‘.i:wtttSA  -V.'  ^ 0-: '»6iixli»5<^  T’-r.^Crliq 


-'i  f 


|;f?£  ^o  *'ioq  ^r^?:am“  a >’* 

^ 4 - ' • ’".®  ‘ 

mt  ^ iit^i  <jf  \'£i'i'im-  - mor:i/^<-cir3(i^oi  y ■ 

■'■  ■;-  i.ui,,^  ■ '■■  .'H'"  ■ * 


'»  ii, . 

jfrMiti  ds'l®  ilu.lii^r 


■ a,*':'  - ,. '■ 


ic  .r^a:rua  y<fJ  i^wod  ?♦ 

y,  *''"  , »,.  ■ 'I.;  ij 

niiiis  i>:om  alia  i'lufc'.T'.'-'*^  rffly  .*V^t^'^^|j^ii'rt'  vt«on{>aa>-e|!^^j”ii 


■ ' ' rijiw  taSar.-r  otU,  ftl  li.T.i’sarftiRt^  Mf  sto. ' «A-’ 1®  s»  t j.ust  .-us  QWfe,t>2 

■”  '■  .fM  '■  ■ 

Tw:A  u ; 


. rvf  , 


J J { Jj^lil<^5D065  W»)  * 

;■  >’':  ';;,,A’'vr*  ‘*i-»'f  ■Y‘>r^**’’.;, ',  .<  --'f^  4 


■f  .mi‘t^litqo\ 

f\  " 

i 


» t j 


’■^a>lho44«%o  sAil  -•rfl'' Snk 


..  ^ 


]t»oqpaiOD  .-^j 


■T<; 


staBcai  a saiai#  Oi^.  0(»U,0(r  \ 

■■,■  ?, -f.fM  ■■ 


., -ff-  .<•'*■  ■K-%'^  ■'  ■<v'  . *;  ' ’tt®'" 

aifl  jg^iiT^wlbj/Ba  s 3 a jtabow  JJjreTji 


.*-<i£  j,'  ♦«  •.  ^ 

^'4f  ^ SQxt  cul  a’b  tfZi wnt  t!%  fi.*  ^ * ' 


y ■ ,.:> ■■  i' 


^ .V--  Mmu.,^ 

i-  ‘3  i- lit  ' U-  7^^  y ^ ' k* 


*'  f'./^  'it 


%?■, 


> ' ^ 'ft 

'«r'^>s>fji  ntifZ  . jm  %3ta  .viauca 


X(i'i>ii-4titt'k'  -•*4:. ■iut.ii'tms  |e4|Li|h:#r»  ftS  si  xTr^t?'*!  avarf  ihsiAi 

• ■ . ' ■ V'^.  ■ '^'  '■  ,.  ' ,..  ■ ■'■  ,S  '..  ■^3-  ■■>'■•  ■ ■■■,.,.  '^-  ^ if 


M'Uii  n i>:t»  Iftr  itf  fj’k  Kot 

r Fff  '^*'  T 


■aoV  . .l^*f  Z9^<M«^jlv 


common  boundary  criteria,  (Some  of  the  regions  surveyed  extended  beyond 

city  lines  without,  however,  encompassing  SMSA's.) 

The  Lead  Poisoning  Prevention  and  Control  program  in  New  Haven,  • 
Connecticut  included  a survey  of  the  city,  in  which  incidences 
were  tabulated  by  census  tract.  This  data  base  has  been  used  in  the 
construction  of  a preliminary  model. 

Insofar  as  pathologies  induced  by  the  level  of  aggregation  are 
concerned j the  results  to  date  have  been  mixed. 

There  is  unquestionably  some  "homogenization"  difficulty.  A study 

2 

reported  in  the  Connecticut  Health  Bulletin  demonstrates  that  the  socio- 
economic character  of  neighborhoods  in  New  Haven,  as  revealed  by  census 
block  group  statistics,  is  submerged  by  the  aggregation  of  the  groups 
into  census  tracts.  We  are  certain  that  this  has  been  a strong  (but  not 
the  only)  factor  in  our  inability  to  fit  a model  to  the  New  Haven  data 
at  the  level  of  precision  we  desire.  (This  report  is  being  produced  prior 
to  attempts  to  validate  any  of  our  models  by  applying  them  to  data  from 
another  city,  Aurora,  Illinois,  from  which  survey  data,  by  tract,  have 
been  promised  to  us,  ) On  the  other  hand,  the  values  predicted  by  these 


2 

E.  Siker,  J,  Deshaies,  S,  Korper,  and  E,  Stockwell,  "Development  of 
a Community  Wide  Health  Information  System- -Neighborhood  Delineations 
Socio-Economic  Status,"  Connecticut  Health  Bulletin  Vol.  84,  No,  9, 
Sept.  1970, 
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models  based  on  data  aggregated  over  the  entire  city,  are  no  worse  than 
the  tract  predictions.  That  is,  although  the  models  are  nonlinear,  the 
initial  tests  don't  discourage  the  hope  that  one  of  them,  if  proven  valid 
over  tracts,  may  be  applicable  to  prediction  at  the  city  or  SMSA  level. 

Homogenizations  resulting  from  aggregating  populations  may  actually 
improve  the  validity  of  model  formulation  in  situations  where  there  are 
"compensating  errors"  in  available  data.  A case  in  point  is  the  use  of 
age  of  housing  as  a determining  parameter  for  incidence  of  EBL.  It  is 
often  assumed  that  a sharp  (and  continuing)  decline  in  the  use  of  lead 
based  paint  in  housing  interiors,  in  favor  of  titanium  based  paint,  com- 
menced in  1940.  Thus,  the  age  of  housing  is  a natural  kind  of  parameter 
to  employ  as  a determinant  of  lead  contamination.  In  particular,  the 
fraction  of  housing  units  erected  before  1940,  which  is  tabulated  by  the 
census,  is  such  a parameter.  It  could  not  be  employed  directly  in  pre- 
liminary modelling  attempts  because  1970  age  of  housing  tabulations  are 
part  of  the  "4th  count"  census  figures  and  will  not  be  issued  until 
March  1972.  Relating  1960  housing  data  to  current  EBL  incidences  cannot 
be  done  reliably  because  of  widespread  "urban  renewal"  activities  which 
normally  eventuate  in  wholesale  demolitions. 

However,  if  we  must  fall  back  on  out  of  date  statistics,  use  of 
1960  age  of  housing  figures,  at  the  SMSA  level,  (with  a crudely  estimated 
uniform  attrition  factor  for  old  houses,)  would  be  vastly  superior  to 
attempting  a tract  or  fine  subdivision  model  employing  1960  housing  data,, 
in  which  errors  of  large  magnitude  in  estimates  of  the  age  of  dwelling 
units  are  to  be  expected. 
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